Journal article
Automated detection of records in biological sequence databases that are inconsistent with the literature
MR Bouadjenek, K Verspoor, J Zobel
Journal of Biomedical Informatics | ACADEMIC PRESS INC ELSEVIER SCIENCE | Published : 2017
Abstract
We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as “confident” or “suspicious”. Our experiments on the PubMed Ce..
View full abstractGrants
Awarded by Australian Research Council
Funding Acknowledgements
The project receives funding from the Australian Research Council through a Discovery Project grant, DP150101550.